Computational Biology and Chemistry
○ Elsevier BV
All preprints, ranked by how well they match Computational Biology and Chemistry's content profile, based on 23 papers previously published here. The average preprint has a 0.02% match score for this journal, so anything above that is already an above-average fit. Older preprints may already have been published elsewhere.
Bello, A. J.; Omotuyi, A. O.; Oladapo, O. B.; Adekunle, A. A.; Udechime, K. U.; Akinwande, A. B.; Odewale, A. F.; Adamson, O. O.; Eban, D. O.; Folarin, O.; Okpuzor, J.; Minari, J. B.
Show abstract
Synonymous codon substitution, a gene engineering approach in synthetic biology, has been effective in improving the codon composition of recombinant genes of interest based on various criteria without altering the amino acid sequence. The SARS-CoV-2 virus nucleocapsid (N) protein is a stable, conserved and highly immunogenic that is less prone to mutation during infection, making it a key antigen in in vitro diagnosis, vaccine development, immunological and structural studies. While reports have focused on applying optimized N protein for different applications, the basic parameters used by different optimization tools for choosing the best approach for the N gene synonymous codon substitution are often neglected. Here, we analyzed the influence of different synonymous codon substitution strategies on SARS-CoV-2 N-protein expression in E. coli. Using different codon optimization (CO) and harmonization (CH) tools, we predicted and compared how parameters such as GC content, Codon Adaptation Index, codon quality and number of rare codons present in these sequences affect the N-protein expression. Our results also show that Minimum Free Energy (MFE) and RNA structure of N-term and C-tail of the N-protein coding sequence influence protein folding. We then predicted that the SR-rich region of the N-protein may contribute to slowing down the elongation rate during translation. This work presents a fundamental analysis of how different optimization tools affect SARS-CoV-2 N-protein expression and folding and suggests a basic approach to choosing the best strategy for optimal expression and folding of the protein for further studies. Graphical Abstract O_FIG O_LINKSMALLFIG WIDTH=200 HEIGHT=60 SRC="FIGDIR/small/622014v3_ufig1.gif" ALT="Figure 1"> View larger version (15K): org.highwire.dtl.DTLVardef@1fbd7f8org.highwire.dtl.DTLVardef@11fe81borg.highwire.dtl.DTLVardef@1bf787borg.highwire.dtl.DTLVardef@17edc57_HPS_FORMAT_FIGEXP M_FIG C_FIG HighlightsCodon substitution affects SARS-CoV-2 nucleocapsid (N) protein expression. SARS-CoV-2 N-protein expression varies with different codon optimization tools. RNA structures of N- and C-term impact RNA stability, protein expression and folding. SR-rich region of the N-protein may slow down elongation rate during translation.
Ghafari, M. D.; Rasooli, I.; Khajeh, K.; Dabirmanesh, B.; Ghafari, M.; Owlia, P.
Show abstract
The phase transition temperature (Tt) prediction of the Elastin-like polypeptides (ELPs) is not trivial because it is related to complex sets of variables such as composition, sequence length, hydrophobic characterization, hydrophilic characterization, the sequence order in the fused proteins, linkers and trailer constructs. In this paper, two unique quantitative models are presented for the prediction of the Tt of a family of ELPs that could be fused to different proteins, linkers, and trailers. The lack of need to use multiple software, peptide information, such as PDB file, as well as knowing the second and third structures of proteins are the advantages of this model besides its high accuracy and speed. One of our models could predict the Tt values of the fused ELPs by entering the protein, linker, and trailer features with R2=99%. Also, another model is able to predict the Tt value by entering the fused protein feature with R2=96%. For more reliability, our method is enriched by Artificial Intelligence (AI) to generate similar proteins. In this regard, Generative Adversarial Network (GAN) is our AI method to create fake proteins and similar values. The experimental results show that our strategy for prediction of Tt is reliable in large data. Author SummaryThe application of Elastin-like polypeptides (ELPs) as a protein tag is now developed in a variety of biotechnology aspects especially in proteins purification and drug delivery. ELPs application as a protein tag is owed to retain the phase transition behavior when ELPs fused to other proteins, linkers and tags. ELPs undergo the phase transition behavior by changing from soluble phase to insoluble phase above its inverse transition temperature (Tt) within a short time span. This biophysical behavior is usually reversible at the temperature below the Tt. There are few reports for evaluation of the Tt of ELPs types by using the dissimilar equations and algorithms. Our current predictions are the most accurate calculations presented so far by using the protein, linker and trailer effects and the results were evaluated in accordance with the available experimental data. Furthermore, our results also show that our strategy for prediction of Tt is reliable in large data.
Lu, Y.; Huang, Y.; Li, T.
Show abstract
In biological systems, protein-protein interactions (PPI) weave intricate network patterns that are fundamental to the structural and functional integrity of organisms. While the majority of existing research has been anchored in the study of pairwise PPIs, the realm of high-order interactions remains relatively untapped. This oversight could potentially obscure the deeper intricacies embedded within biological networks. To address this gap, this study formulates a scientific task aimed at predicting high-order protein-protein interactions and introduces a multi-level comprehensive dataset focused on triadic high-order interactions within PPI networks. This dataset incorporates more than 80% of the known human protein interaction relationships and partitions into 60 subsets across a diverse range of functional contexts and confidence. Through meticulous evaluation using cutting-edge high-order network prediction tools and benchmark PPI prediction methodologies, our findings resonate with the principle that "more is different". Triadic high-order interactions offer a more enriched and detailed informational canvas than their pairwise counterparts, paving the way for a deeper comprehension of the intricate dynamics at play in biological systems. In summary, this research accentuates the critical importance of high-order PPI interactions in biological systems and furnishes invaluable resources for subsequent scholarly investigations. The dataset is poised to catalyze future research endeavors in protein-protein interaction networks, elucidating their pivotal roles in both health and disease states.
Yan, Z.; Omori, S.; Yamada, K. D.; Nishi, H.; Kinoshita, K.
Show abstract
The biological functions of proteins are traditionally thought to depend on well-defined three-dimensional structures, but many experimental studies have shown that disordered regions lacking fixed three-dimensional structures also have crucial biological roles. In some of these regions, disorder-order transitions are also involved in various biological processes, such as protein-protein interaction and ligand binding. Therefore, it is crucial to study disordered regions and structural transitions for further understanding of protein functions and folding. Owing to the costs and time requirements of experimental identification of natively disordered or transitional regions, the development of effective computational methods is a key research goal. In this study, we used overall residue dependencies and deep representation learning for prediction and reused the obtained disordered regions for the prediction of disorder-order transitions. Two similar and related prediction tasks were combined. Firstly, we developed a novel deep learning method, Res-BiLstm, for residue-wise disordered region prediction. Our method outperformed other predictors with respect to almost all criteria, as evaluated using an independent test set. For disorder-order transition prediction, we proposed a transfer learning method, Res-BiLstm-NN, with an acceptable but unbalanced performance, yielding reasonable results. To grasp underlining biophysical principles of disorder-order transitions, we performed qualitative analyses on the obtained results and discovered that most transitions have strong disordered or ordered preferences, and more transitions are consistent with the ordered state than the disordered state, different from conventional wisdom. To the best of our knowledge, this is the first sizable-scale study of transition prediction. Availabilityhttps://github.com/Yanzziang/Transition_Disorder_Prediction Contactkengo@ecei.tohoku.ac.jp
Yang, Z.; Shao, W.; Matsuda, Y.; Song, L.
Show abstract
MotivationDespite the development of several computational methods to predict DNA modifications, two main limitations persist in the current methodologies: 1) All existing models are confined to binary predictor which merely determine the presence or absence of DNA modifications, constraining comprehensive analyses of the interrelations among varied modification types. While multi-class classification models for RNA modifications have been developed, a comparable approach for DNA remains a critical need. 2) The majority of previous studies lack adequate explanations of how models make decisions, relying on the extraction and visualization of attention matrices which identified few motifs, and do not provide sufficient insight into the model decision making process. ResultIn this study, we introduce iResNetDM, a deep learning model that integrates ResNet and self-attention mechanisms. To the best of our knowledge, iResNetDM is the first model capable of distinguishing between four types of DNA modifications. It not only demonstrates high performance across various DNA modifications but also unveils the potential capabilities of CNN and ResNet in this domain. To augment the interpretability of our model, we implemented the integrated gradients technique, which was pivotal in demystifying the models decision-making framework, allowing for the successful identification of multiple motifs. Importantly, our model exhibits remarkable robustness, successfully identifying unique motifs across different modifications. Furthermore, we compared the motifs discovered in various modifications, revealing that some motifs share significant sequence similarities which suggests that these motifs may be subjected to different types of modifications, underscoring their potential importance in gene regulation. Contactzeruiyang2-c@my.cityu.edu.hk
Tejera-Nevado, P.; Serrano, E.; Gonzalez-Herrero, A.; Bermejo-Moreno, R.; Rodriguez-Gonzalez, A.
Show abstract
The determination of protein structure has been facilitated using deep learning models, which can predict protein folding from protein sequences. In some cases, the predicted structure can be compared to the already-known distribution if there is information from classic methods such as nuclear magnetic resonance (NMR) spectroscopy, X-ray crystallography, or electron microscopy (EM). However, challenges arise when the proteins are not abundant, their structure is heterogeneous, and protein sample preparation is difficult. To determine the level of confidence that supports the prediction, different metrics are provided. These values are important in two ways: they offer information about the strength of the result and can supply an overall picture of the structure when different models are combined. This work provides an overview of the different deep-learning methods used to predict protein folding and the metrics that support their outputs. The confidence of the model is evaluated in detail using two proteins that contain four domains of unknown function.
Middendorf, L.; Eicholt, L. A.
Show abstract
Understanding the emergence and structural characteristics of de novo and random proteins is crucial for unraveling protein evolution and designing novel enzymes. However, experimental determination of their structures remains challenging. Recent advancements in protein structure prediction, particularly with AlphaFold2 (AF2), have expanded our knowledge of protein structures, but their applicability to de novo and random proteins is unclear. In this study, we investigate the structural predictions and confidence scores of AF2 and protein language model (pLM)-based predictor ESMFold for de novo, random, and conserved proteins. We find that the structural predictions for de novo and random proteins differ significantly from conserved proteins. Interestingly, a positive correlation between disorder and confidence scores (pLDDT) is observed for de novo and random proteins, in contrast to the negative correlation observed for conserved proteins. Furthermore, the performance of structure predictors for de novo and random proteins is hampered by the lack of sequence identity. We also observe varying predicted disorder among different sequence length quartiles for random proteins, suggesting an influence of sequence length on disorder predictions. In conclusion, while structure predictors provide initial insights into the structural composition of de novo and random proteins, their accuracy and applicability to such proteins remain limited. Experimental determination of their structures is necessary for a comprehensive understanding. The positive correlation between disorder and pLDDT could imply a potential for conditional folding and transient binding interactions of de novo and random proteins.
Medeiros, V. M.; Pearl, J. M.; Carboni, M.; Er, E. M.; Zafeiri, S.
Show abstract
The prediction of tertiary RNA structures is significant to the field of medicine (e.g. mRNA vaccines, genome editing), and the exploration of viral transcripts. Though many RNA folding software exist, few studies have condensed their locus of attention solely to viral pseudoknotted RNA. These regulatory pseudoknots play a role in genome replication, gene expression, and protein synthesis. This study explores five RNA folding engines that compute either the minimum free energy (MFE) or the maximum expected accuracy (MEA). These folding engines were tested against 26 experimentally derived short pseudoknotted sequences (20-150nt) using metrics that are commonly applied to software prediction accuracy (e.g. F1 scoring, PPV). This paper reports higher accuracy RNA prediction engines, such as pKiss, when compared to previous iterations of the software, and when compared to older folding engines. They show that MEA folding software does not always outperform MFE folding software in prediction accuracy when assessed with metrics such as percent error, sensitivity, PPV, and F1 scoring when applied to viral pseudoknotted RNA. Moreover, the results suggest that thermodynamic model parameters will not ensure accuracy if auxiliary parameters such as Mg2+ binding, dangling end options, and H-type penalties are not applied. The observations reported in this paper highlight the quality between different ab initio prediction methods while enforcing the idea that a better understanding of intracellular thermodynamics is necessary for a more efficacious screening of RNAs. ImportanceThe importance of accurately predicting RNA structures cannot be overstated, particularly in the context of viral biology and the development of therapeutic interventions such as mRNA vaccines and genome editing. Our study addresses the gap in the existing literature by concentrating solely on viral pseudoknotted RNA, which plays a crucial role in viral replication, gene expression, and protein synthesis. Our study sheds light on the debate surrounding minimum free energy (MFE) versus maximum expected accuracy (MEA) models in RNA folding predictions. Contrary to existing beliefs, we found that MEA models do not consistently outperform MFE models, especially in the context of viral pseudoknotted RNAs. Our research contributes to advancing the field of computational biology by providing insights into the efficacy of different prediction methods and emphasizing the need for a deeper understanding of intracellular thermodynamics to improve RNA structure predictions.
Marques, L. F.; Wolf, I. R.; Lazari, L. C.; Almeida, L. F. d.; Schnepper, A. P.; Cardoso, L. H.; Moraes, L. N. d.; Grotto, R. M. T.; Simoes, R. P.; Ramos, E.; Valente, G. T.
Show abstract
The ethanol disturbs the cell cycle, transcription, translation, protein folding, cell wall, membranes, and many Saccharomyces cerevisiae metabolic processes. Long non-coding RNAs (lncRNAs) are regulatory molecules binding onto the genome or proteins. The number of lncRNAs described for yeast is still scarce, and little is known concerning their roles in the system. There is a lack of knowledge concerning how lncRNAs are responsive to the ethanol tolerance in yeast and whether they act in this tolerance. Hence, by using RNA-Seq data from S. cerevisiae strains with different ethanol tolerance phenotypes, we found the severe ethanol responsive lncRNAs. We modeled how they participate in the ethanol tolerance by analyzing lncRNA-protein interactions. The results showed that the EtOH tolerance responsive lncRNAs, in both higher tolerant and lower tolerant phenotypes, work on different pathways: cell wall, cell cycle, growth, longevity, cell surveillance, ribosome biogenesis, intracellular transport, trehalose metabolism, transcription, and nutrient shifts. In summary, lncRNAs seems to interconnect essential systems modules to overcome the ethanol stress. Finally, here we also found the most extensive catalog of lncRNAs in yeast.
Celestini, A.; Cianfriglia, M.; Mastrostefano, E.; Palma, A.; Castiglione, F.; Tieri, P.
Show abstract
Network-based ranking methods (e.g., centrality analysis) have found extensive use in systems biology and network medicine for the prediction of essential proteins, for the prioritization of drug targets candidates in the treatment of several pathologies and in biomarker discovery, and for human disease genes identification. We here studied the connectivity of the human protein-protein interaction network (i.e., the interactome) to find the nodes whose removal has the heaviest impact on the network, i.e., maximizes its fragmentation. Such nodes are known as Critical Nodes (CNs). Specifically, we implemented a Critical Node Heuristic (CNH) and compared its performance against other four heuristics based on well known centrality measures. To better understand the structure of the interactome, the CNs role played in the network, and the different heuristics capabilities to grasp biologically relevant nodes, we compared the sets of nodes identified as CNs by each heuristic with two experimentally validated sets of essential genes, i.e., the genes whose removal impact on a given organisms ability to survive. Our results show that classical centrality measures (i.e., closeness centrality, degree) found more essential genes with respect to CNH on the current version of the human interactome, however the removal of such nodes does not have the greatest impact on interactome connectivity, while, interestingly, the genes identified by CNH show peculiar characteristics both from the topological and the biological point of view. Finally, even if a relevant fraction of essential genes is found via the classical centrality measures, the same measures seem to fail in identifying the whole set of essential genes, suggesting once again that some of them are not central in the network, that there may be biases in the current interaction data, and that different, combined graph theoretical and other techniques should be applied for their discovery.
Warrier, V. P.; Paruchuri, A.; Ramasamy, S.; Gromiha, M. M.; Karunagaran, D.
Show abstract
Epidermal Growth Factor Receptor (EGFR) signaling is known to play essential roles in growth and development; nevertheless, overexpression and mutation of EGFR have been reported in several cancers. Non-small cell lung cancer (NSCLC), the most observed type of lung cancer, harbors the highest number of EGFR tyrosine kinase mutations and therefore, EGFR has become an important therapeutic target for treatment of these tumors. Tyrosine Kinase Inhibitors (TKIs) are found to be effective in patients whose tumors contain activating mutations in the tyrosine kinase region of the receptor. This would seem to be beneficial in the treatment of EGFR mutation-positive NSCLC patients but the activating mutations should be sensitive to TKIs. Earlier, a machine learning approach was developed to classify single amino acid polymorphisms (SAPs) in EGFR into driver (cancer-causing) and passenger (neutral) mutations using structural and functional features (Anoosha et al., 2015). This study screened all possible point mutations in EGFR and predicted a list of mutations with high probability of being a driver or a passenger. From this list, we selected 2 mutations (G729E and G719F) with high evolutionary conservation score for in vitro validation. If proven to be oncogenic drivers and sensitive to EGFR TKIs, these mutations can aid in the early diagnosis and successful therapy of EGFR mutation-positive NSCLC.
Verma, M.
Show abstract
Lung cancer (LC) remains a significant global health concern, affecting millions worldwide each year. Tumor-infiltrating immune cells (TIICs) play a crucial role in Lung Cancer progression and prognosis, with various immune cell types infiltrating the tumor microenvironment. Traditional methods like immunohistochemistry and flow cytometry have limitations in accurately profiling TIIC subtypes. However, recent advancements in single-cell RNA sequencing and computational algorithms like CIBERSORTx offer a promising approach for characterizing TIICs in bulk tumor samples. In this study, we undertook the validation of the signature matrix comprising 14 distinct immune cell types and subtypes, which was originally derived from PBMC single-cell RNA-seq data, in our previous work (Verma, 2024). The positive controls included 8 bulk RNA-seq samples of whole blood and specific immune cell bulk RNA-seq samples, while the negative control comprised neuroblastoma cell lines lacking immune content. Subsequently, we applied this signature matrix to deconvolute TCGA-LUAD data (n = 598), and assessed tumor purity and immune-stromal content using the ESTIMATE algorithm. Our findings indicate that the signature matrix accurately reflected flow cytometry-derived fractions, supported by correlation analysis. Specifically, the second positive control and negative control accurately reflected immune and non-immune sample fractions, respectively, further validating the efficacy of our approach. This study also provide insights into the invasion of immunocytes in lung adenocarcinoma and highlight the potential of computational tools like CIBERSORTx and ESTIMATE in characterizing the immune microenvironment of LC.
Lin, T.-T.; Yang, L.-Y.; Lu, I.-H.; Cheng, W.-C.; Hsu, Z.-R.; Chen, S.-H.; Lin, C.-Y.
Show abstract
MotivationAntimicrobial peptides (AMPs) are innate immune components that have aroused a great deal of interest among drug developers recently, as they may become a substitution for antibiotics. However, AMPs discovery through traditional wet-lab research is expensive and inefficient. Thus, we developed AI4AMP, a user-friendly web-server that provides an accurate prediction of the antimicrobial activity of a given protein sequence, to accelerate the process of AMP discovery. ResultsOur results show that our prediction model is superior to the existing AMP predictors. AvailabilityAI4AMP is freely accessible at http://symbiosis.iis.sinica.edu.tw/PC_6/ Contactcylin@iis.sinica.edu.tw
Rizvi, S. M.; Zheng, W.; Zhang, C.; Zhang, Y.
Show abstract
Myoglobin is the major oxygen carrying protein of vertebrate muscle, and high myoglobin net charge is known to hold evolutionary significance as a molecular signature of secondarily aquatic diving capacity in mammals. However, the evolution of myoglobins electrostatic properties in non-mammalian vertebrates, such as birds, has not been investigated. Here, we used a new deep learning-based protein folding algorithm to model the tertiary structures of myoglobin from 302 vertebrate species and performed a comparative analysis of their net charge, positively charged solvent-accessible surface area, and negatively charged solvent-accessible surface area. For avian myoglobins, we also calculated selection pressure ({omega}). The results suggest that the myoglobins of diving avians, specifically those of the penguins (Sphenisciformes) and diving ducks (Aythyini), have highly positively charged electrostatic surfaces, which evolved via positive selection to reduce aggregation propensity and allow greater storage of oxygen for extended underwater foraging. In contrast, galliform myoglobins are under high purifying selection. Distribution of charged atoms on myoglobin surface was more indicative of high myoglobin content than net charge. We also found inter-class differences in net charge; bird myoglobins are the most positively charged and reptile and amphibian myoglobins are the most negatively charged, and net charge seems to be negatively associated with herbivory within mammals. Finally, we propose an equation that describes the relationship between myoglobin net charge and concentration better than the previously suggested logarithmic function. Our findings offer novel insights into the diversification of myoglobin in vertebrate clades and highlight the power of computational structural approaches for zoological and evolutionary research.
Tsenum, J. L.
Show abstract
Long non-coding RNAs (lncRNAs) can perform their regulatory roles by forming triple helices through RNA-DNA interaction. Although this has been verified by few in vivo and in vitro methods, in silico approaches that seek to predict the potentials of lncRNAs and DNA sites becoming a triplex forming structure is required. Triplexator have also predicted vast amounts of lncRNAs and DNA sites that has the potentials of becoming a triplex structure. There is also an emerging experimental-evidence that the presence of epigenetic marks at DNA sites and lncRNAs can facilitate the formation of RNA:DNA triplex structures. There is therefore, a huge demand for computati onal approaches such as deep learning that can make novel predictions about RNA:DNA triplex structure formation. In this study, we developed four (4) deep neural network models that can predict the potentials of lncRNAs and DNA sites to form triple helices genome-wide, by taking histone modification marks as features. Our data was first passed through the Triplexator to screen out lncRNAs and DNA sites with low potentials of forming triple helices. We used different deep learning architectures to build our models, including two-layer convolutional neural networks (CNN) and multilayer perceptron (MLP). Our DNA2_CNN model performed best at a mean AUC of 0.78 at 32 Kernel size and learning rate of 1e-3. Our deep neural network models revealed several novel lncRNAs and DNA sites, including HOTAIR, MEG3, PARTICLE, DACOR1, MIR100HG, FENDRR, ANRIL, TUG1, MALAT1, LINC00599, TINCR, NEAT1, roX2, DHFR, OTX2-AS1, Xist, SNHG16, ATXN8OS, BCYRN1, TERC, Khps1, that have the potential of forming triplex structures, thereby confirming previous experimental results and that of the Triplexator. The performance of our models also supports previous findings that histone modification marks can help in identifying lncRNAs and DNA regions that have the potentials of forming RNA:DNA triplex structures. In conclusion, we showed that different deep learning architectures can recognize lncRNAs and DNA that have the potentials of forming RNA:DNA triplex structures.
Farheen, F.; Broyles, B. K.; Zhang, Y.; Ibtehaz, N.; Erkine, A. M.; Kihara, D.
Show abstract
Analysis of factors that lead to the functionality of transcriptional activation domains remains a crucial and yet challenging task owing to the significant diversity in their sequences and their intrinsically disordered nature. Almost all existing methods that have aimed to predict activation domains have involved traditional machine learning approaches, such as logistic regression, that are unable to capture complex patterns in data or plain convolutional neural networks and have been limited in exploration of structural features. However, there is a tremendous potential in the inspection of the structural properties of activation domains, and an opportunity to investigate complex relationships between features of residues in the sequence. To address these, we have utilized the power of graph neural networks which can represent structural data in the form of nodes and edges, allowing nodes to exchange information among themselves. We have experimented with two kinds of graph formulations, one involving residues as nodes and the other assigning atoms to be the nodes. A logistic regression model was also developed to analyze feature importance. For all the models, several feature combinations were experimented with. The residue-level GNN model with amino acid type, residue position, acidic/basic/aromatic property and secondary structure feature combination gave the best performing model with accuracy, F1 score and AUROC of 97.9%, 71% and 97.1% respectively which outperformed other existing methods in the literature when applied on the dataset we used. Among the other structure-based features that were analyzed, the amphipathic property of helices also proved to be an important feature for classification. Logistic regression results showed that the most dominant feature that makes a sequence functional is the frequency of different types of amino acids in the sequence. Our results consistent have shown that functional sequences have more acidic and aromatic residues whereas basic residues are seen more in non-functional sequences.
Sun, T.; Li, Q.; Xu, Y.; Zhang, Z.; Lai, L.; Pei, J.
Show abstract
The liquid-liquid phase separation (LLPS) of bio-molecules in cell underpins the formation of membraneless organelles, which are the condensates of protein, nucleic acid, or both, and play critical roles in cellular functions. The dysregulation of LLPS might be implicated in a number of diseases. Although the LLPS of biomolecules has been investigated intensively in recent years, the knowledge of the prevalence and distribution of phase separation proteins (PSPs) is still lag behind. Development of computational methods to predict PSPs is therefore of great importance for comprehensive understanding of the biological function of LLPS. Here, a sequence-based prediction tool using machine learning for LLPS proteins (PSPredictor) was developed. Our model can achieve a maximum 10-CV accuracy of 96.03%, and performs much better in identifying new PSPs than reported PSP prediction tools. As far as we know, this is the first attempt to make a direct and more general prediction on LLPS proteins only based on sequence information.
wu, l.; pan, x.; xia, x.
Show abstract
Protein structure resolution has lagged far behind sequence determination, as it is often laborious and time-consuming to resolve individual protein structure - more often than not even impossible. For computational prediction, due to the lack of detailed knowledge on the folding driving forces, how to design an energy function is still an open question. Furthermore, an effective criterion to evaluate the performance of the energy function is also lacking. Here we present a novel knowledge-based-energy scoring function, simply considering the interactions of peptide bonds, rather than, as conventionally, the residues or atoms as the most important energy contribution. This energy scoring was evaluated by selecting the X-ray structure from a large number of possibilities. It not only outperforms the best of the previously published statistical potentials, but also has very low computational expense. Besides, we suggest an alternative criterion to evaluate the performance of the energy scoring function, measured by the template modeling score of the selected rank-one. We argue that the comparison should allow for some deviation between the x-ray and predicted structures. Collectively, this accurate and simple energy scoring function, together with the optimized criterion, will significantly advance the computational protein structure prediction.
Nithin, C.; Mukherjee, S.; Basak, J.; Bahadur, R. P.
Show abstract
Non-coding RNAs (ncRNAs) are major players in the regulation of gene expression. This study analyses seven classes of ncRNAs in plants using sequence and secondary structure-based RNA folding measures. We observe distinct regions in the distribution of AU content along with overlapping regions for different ncRNA classes. Additionally, we find similar averages for minimum folding energy index across various ncRNAs classes except for pre-miRNAs and lncRNAs. Various RNA folding measures show similar trends among the different ncRNA classes except for pre-miRNAs and lncRNAs. We observe different k-mer repeat signatures of length three among various ncRNA classes. However, in pre-miRs and lncRNAs, a diffuse pattern of k-mers is observed. Using these attributes, we train eight different classifiers to discriminate various ncRNA classes in plants. Support-vector machines employing radial basis function show the highest accuracy (average F1 of ~91%) in discriminating ncRNAs, and the classifier is implemented as a web server, NCodR.
Moshfeghnia, M.; Jalili, H.; Marashi, S.-A.
Show abstract
Chinese hamster ovary (CHO) cells are a multipurpose and high-performance cell line for recombinant protein production in biopharmaceutical industry. They have proven their ability to produce a wide range of therapeutic proteins with high efficiency and quality. Designing novel and high-performance CHO cell lines has an incredible impact in biopharmaceutical industry that can reduce prices and increase product efficiency. One of the best ways is to prevent CHO cells death during Bioprocessing. Apoptosis is the most common form of CHO cells death during Bioprocessing. Analyzing Apoptosis and cell-cycle complex signaling pathways are necessary for the control of cell growth, efficiency, and the death of CHO cells. Therefore, analyzing and understanding interactions of these pathways and their interactions with other cellular processes can help optimize the performance and quality of CHO cell lines. AI-driven insight solutions and Advanced machine learning algorithms like GAT (Graph Attention Network) used in this project indicate most important Targets in complex signaling pathways. Pathways such as the TNF signaling pathway, and also viruses like: Hepatitis C, HIV1 and Bacteria like: Salmonella have High intersection size and Low P-value with complex signaling pathways. These microorganisms should be used to design high-performance CHO cell lines because they are master in these pathways. This method can be used to find novel and high efficiency targets for curing cancer in humans.